-
Notifications
You must be signed in to change notification settings - Fork 916
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixing parquet list of struct interpretation #13715
Fixing parquet list of struct interpretation #13715
Conversation
…quet_list_of_struct
I tried testing this PR with the Spark plugin, but Apache Spark thinks the schema of the repeated_no_annotation.parquet file is:
I think the issue is that there's no logical type annotation on |
…c2346/cudf into mwilson/parquet_list_of_struct
This is really close now, but not 100%. expected:
produced:
Note the null difference on the inner list. |
…quet_list_of_struct
…c2346/cudf into mwilson/parquet_list_of_struct
Spark integration tests show a few errors like More investigation is warranted here. |
…quet_list_of_struct
Integration tests for spark are passing now and this is ready for review. |
/ok to test |
/ok to test |
/ok to test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good; few questions.
@@ -175,6 +175,81 @@ type_id to_type_id(SchemaElement const& schema, | |||
return type_id::EMPTY; | |||
} | |||
|
|||
void metadata::sanitize_schema() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
neat approach, localizes the madness 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this could ultimately fix things like single-level list as well. I like this approach too.
…c2346/cudf into mwilson/parquet_list_of_struct
/ok to test |
/merge |
Description
This change alters how we interpret non-annotated data in a parquet file. Most modern parquet writers would produce something like:
But the list annotation isn't required. If it didn't exist, we would incorrectly interpret this schema as a struct of struct and not a list of struct. This change alters the code to look at the child and see if it is repeated. If it is, this indicates a list.
closes #13664
Checklist